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Abstract 

This paper contributes a joint embedding 
model for predicting relations between a 
pair of entities in the scenario of rela¬ 
tion inference. It differs from most stand¬ 
alone approaches which separately oper¬ 
ate on either knowledge bases or free 
texts. The proposed model simultaneously 
learns low-dimensional vector representa¬ 
tions for both triplets in knowledge reposi¬ 
tories and the mentions of relations in free 
texts, so that we can leverage the evidence 
both resources to make more accurate pre¬ 
dictions. We use NELL to evaluate the 
performance of our approach, compared 
with cutting-edge methods. Results of ex¬ 
tensive experiments show that our model 
achieves significant improvement on rela¬ 
tion extraction. 


1 Introduction 


Relation extraction ( |Bach and Badaskar, 2007 [[Gr 
ishman, 1997t Sarawagi, 20081, which aims at dis¬ 
covering the relationships between a pair of en¬ 
tities, is a significant research direction for dis¬ 
covering more beliefs for knowledge bases. Most 
stand-alone approaches, however, either use lo¬ 
cal graph patterns in knowledge repositories, or 
extract features from text mentions, to individu¬ 
ally help predict relations between two entities. 
The heterogeneity brings about a gap between 
structured repositories and unstructured free texts, 
which spoils the dream of sharing the evidence 
from both knowledge and natural language. 

For studies in decades, scientists either com¬ 
pete the performance of their methods on the 


public tex t datasets such as AC^ (GuoDong et 


al., 2005 1 and MUC0 ( jZelenko et al., 2003| , 


or look for effective approaches (Gardner et al.. 


2013 Lao et al., 201 1|) on improving the accu¬ 


racy of link prediction within knowledge bases 
such a s NELLpI (Carlson et al, 2010 ) and Free- 
base|^ ( Bollacker et al., 2007 1. Thanks to the 
research of distantly supervised relation extrac¬ 
tion ( Fan et al., 2014al |Mintz et al., 2009 1 
which facilitates the manual annotation via auto¬ 
matically aligning with the relation mentions in 
free texts, NEFF can not only extract triplets, 
i.e. {head.entity, relation, tail .entity), but also 
collect the texts between two entities as the ev¬ 
idence of relation mention. We take an exam¬ 
ple from NEFF which originally records a be¬ 
lief: {concept : city : Caroline, concept : 
citylocatedinstate, concept : stateorprovince : 
maryland. County and State of), where 
“County and State of” is the mention between 
the head entity concept : city : Caroline, 
and the tail entity concept : stateorprovince : 
maryland, to indicate the relation concept : 
citylocatedinstate. 


Fortunately, the embedding techniques (Fan et 


al., 2014b| [Mikolov et al., 2()T3 1 enlighten us to 


break through the limitation of heterogeneous re¬ 
sources, and to establish a connection between a 
relation and its corresponding mention via learn¬ 
ing a specific vecfor represenfafion for each of fhe 
elemenfs, including fhe enfifies and relafions in 
friplefs, and fhe words in menfions. More specifi¬ 
cally, we propose a joinf relafion menfion embed¬ 
ding (JRME) model in fhis paper, which simulfa- 


'http://www.itl.nist.gov/iad/mig/tests/ace/ 
^http://www.itl.nist.gov/iaui/894.02/related projects/muc/ 
^http://rtw.ml.cmu.edu/rtw/ 

‘'http://www.freebase.com/ 



































neously learns low-dimensional vector representa¬ 
tions for entities and relations in knowledge repos¬ 
itories, and in the meanwhile, each word in the re¬ 
lation mentions is also trained a dedicated embed¬ 
ding. This model helps us take advantage of the 
benefits from the two resources to make more ac¬ 
curate predictions. We use two different datasets 
extracted from NELL to evaluate the performance 
of JRME, compared with cutting-edge methods. It 
turns out that our model achieves significant im¬ 
provement on relation extraction. 

2 Related Work 

We group some recent work on relation extrac¬ 
tion into two categories, i.e. text-based approaches 
and knowledge-based methods. Generally speak¬ 
ing, both of the parties seek better evidences to 
make more accurate predictions. The text-based 
community focuses on linguistic features such as 
the words combined with POS tags that indicate 
the relations, but the other side conducts relation 
inference depending on the local connecting pat¬ 
terns between entity pairs learnt from the knowl¬ 
edge graph which is established by beliefs. 

2.1 Text-based Approaches 

It is believed that the text between two recognized 
entities in a sentence indicate their relationships 
to some extent. To implement a relation extrac¬ 
tion system guided by supervised learning, a key 
step is to annotate the training data. Therefore, 
two branches emerge as follows. 


Relation extraction with manual annotated 
corpora: Traditional approaches compete 
the performance on the public text datasets 
which are annotated by experts, such as ACE 
and MUC. They choose different features ex¬ 
tracted from the texts, like kernel features 


(Zelenko et ah, 20031 or semantic parser 
features ( GuoDong et ah, 2005 | l, and there 
is a comprehensive survey (Sarawagi, 20081 
which shows more details about this branch. 


corpora, we still suffer from the problem of 
noisy and sparse features (|Ean et ah, 2014a|). 


2.2 Knowledge-based Methods 

Knowledge bases contain millions of entries 
which are usually represented as triplets, i.e. 
{head-entity, relation, tail-entity), which intu¬ 
itively inspire us to regard the whole repository 
as a graph, where entities are nodes and rela¬ 
tions are edges. Therefore, one research commu¬ 
nity looks forward to predicting unknown relations 
which may exist between two entities via learn¬ 
ing the linking patterns, and another promising re¬ 
search group tries to learn structured embeddings 
of knowledge bases. 


Relation prediction with graph patterns: 
Some canonical studies ([Gardner et ah, 2013 


Lao et ah, 2011 1 adopt a data-driven random 


walk model, which follows the paths from 
the head entity to the tail entity on the lo¬ 
cal graph structure to generate non-linear fea¬ 
ture combinations to represent relations, and 
then uses logistic regression to select the sig¬ 
nificant features that contribute to classifying 
other entity pairs which also have the given 
relation. 


• Relation prediction with embedding repre¬ 
sentations: Bordes et al. ( jBordes et ah, 2013| 
?) propose an alternative way that embedding 
the whole knowledge graph via learning a 
specific low-dimensional vector for each en¬ 
tity and relation, so that we just need simple 
vector calculation instead to predict relations. 


Our model (JRME) benefits more from the lat¬ 
est and state-of-the art embedding approaches, 
TransE ( [Bordes et ah, 2013] l and IIKE ( [Ean et aQ 


2015a I. Therefore, we re-implement them as the 


rival methods, and conduct extensive comparisons 
in the subsequent experiments. 


3 Model 


Relation extraction with distant supervision: 
Due to the limited scale and tedious la¬ 
bor caused by manual annotation, scien¬ 
tists explore an alternative way to automati¬ 
cally generate large-scale annotated corpora. 


The heterogeneity between free texts and knowl¬ 
edge bases brings about a challenge that we can 
hardly take advantage of the features uniformly, 
since they are located in different spaces and have 
varies dimensions. Thankfully, the embedding 
named by distant supervision ( [Mintz et al.,[ techniques ([Ean et ah, 2014b| [Mikolov et ah. 


20091. Even though this cutting-edge tech- [2013[ [Ean et ah, 2015b Ean et alTTl ) leave an 


nique solves the issue of lacking annotated idea that almost all the elements, including words. 


































Figure 1: Given a belief, h : city : Caroline, r : citylocatedinstate,t : stateorprovince : maryland 
and m : County and State of in NELL, (a) shows the distributed representations of a triplet in the 
knowledge space, and (b) illustrates word embeddings in the text space. 


entities, relations, can be learnt and assigned dis¬ 
tributed representations, and the mission remaind 
for us is to jointly learn embeddings for entities, 
relations, and the words in the same feature space. 

We arrange the subsequent content as follows: 
Section 3.1 and 3.2 describe how to model the 
knowledge and texts individually, and we finally 
talk about the proposed jointly embedding model 
in Section 3.3. 


knowledge base K: 


arg min Cr 

rr' 


^ ^ [a + Dr{h,r,f) 

(h,r^t)GK {h,r',t)£K' 


- Drih,r',t)]+, 


( 2 ) 


in which [ ]+ is a hinge loss function, i.e. [x]+ = 
max(0, x). 


3.1 Knowledge Relation Embedding 


Inspired by TransE ( Bordes et al., 201^ , we re¬ 
gard the relation r between a pair of entities, i.e. 
h and t, as a transition, due to the hierarchical 
structure of knowledge graphs. Therefore, we use 
Dr{h, r, t) as follows to denote the plausibility of 
a triplet {h, r, t) illustrated by Ligure 1(a): 


Dr{h, r, f) = |h -|- r — t| 


( 1 ) 


where the closer h -I- r is to t, the more likely the 
triplet {h, r, t) exists. The bold fonts indicate the 
vector representations, e.g. the embedding of the 
head entity /i is h G where d is short for di¬ 
mension. 

Assume that R is the set of relations. Given 
a correct triplet {h,r,t), we aim at pushing all 
the possible corrupt triplets with wrong relations 
{r'|r' G R h r' r} away. Therefore, we adopt 
a margin-based ranking loss function with a block 
a to separate all the negative triplets in the cor¬ 
rupted base K' from all the positives in the correct 


3.2 Text Mention Embedding 

Similar to the Knowledge Relation Embedding 
(KBE), we can also find an approach fo measure 
fhe disfance befween fhe mention m and ifs cor¬ 
responding relafion r in Texf Mention Embedding 
(TME). To denofe fhe embedding of mention m, 
we sum all fhe embeddings of words included by 
m as shown by Equafion (3). Thanks to represent¬ 
ing all the words and relations in vectors with the 
same dimension which is demonstrated by Eigure 
1(b), we can adopt inner product function shown 
by Equation (4) to calculate their similarity. 

m = ^ w, (3) 

wGm 

Dm(r, m) = (4) 

Before using the margin-based ranking loss func¬ 
tion to learn, we need to construct the negative set 
T' for each pair of relation mention (r, m) which 
appears in the correct training set T. To gener¬ 
ate the negative pairs {r',m), we keep the men¬ 
tion m but iteratively change other relations from 
the set of relations R. The subsequent Eormula 









(5) helps to discriminate between the two oppo¬ 
nent sets with a margin /3, 

arg min Cm = W + Dm{r, m) 

(r,m)eT (r',m)eT' 

- Dmir',m)]+. 

(5) 


3.3 Joint Relation Mention Embedding 

Due to the uniform modeling standard of KBE and 
TME, we can jointly embed the relations and cor¬ 
responding mentions (JRME) with Equation (6), 

arg min C = 

{h,r,t,m)&KT {h,r',t,m)&KT' 

+ Dr{h,r,t) — Dr{h,r',t) ^ ^ 

-h Dm(r, m) - Dm{r', m)]+, 

in which each belief {h, r, t, m) belonging to the 
training set KT contains two entities, the relation 
and its corresponding mention. 

If we achieve the learnt embeddings for all the 
entities, relations and words in mentions, we can 
simply use Equation (7) to measure the rationality 
of a relation r appearing between a pair of entities 
h, t with the evidence of m: 

Score{h, r, t, m) = Dr{h, r, t) + Dm{r, m) (7) 

4 Experiments 

We set up three objectives for evaluating the effec¬ 
tiveness of JRME, which are: 


• testing the effectiveness of JRME in terms of 
different evaluation protocols/metrics; 

• comparing the performances of JRME with 
other cutting-edge approaches; 

• judging the robustness of the proposed model 
by using a larger but noisy dataset. 


Section 4.1 and 4.2 display the different datasets 
and the various protocols we use to measure the 
performance compared with several state-of-the- 
art approaches, i.e TransE (Bordes et ah, 20131 
and IIKE ( |Fmi et ah, 2015a I. Section 4.3 will show 
the results of the extensive experiments. 


DATASET 

NELL-50K 

NELL-5M 

#(ENTITIES) 
#(REEATIONS) 
#(TRAINING EX.) 
#(VAEIDATING EX.) 
#(TESTING EX.) 

29,904 

233 

57,356 

10.710 

10.711 

177,635 

236 

5,000,000 

47,335 

47,335 


Table 1: Statistics of the datasets used for relation 
prediction task. 


4.1 Datasets 


We prepare two datasets with different statistical 
characteristics. As illustrated by Table 1, both 


of them are generated by NEEE (Carlson et ah. 


20101, a Never-Ending Eanguage Eeamer which 
works on automatically extracting beliefs from the 
Web. NEEE-50K is a medium size dataset, and 
each belief, which contains the head entity h, the 
tail entity t, the relation r between them, and the 
mention m indicate the relation, is validated by ex¬ 
perts. However, NEEE-5M is a much larger one 
with five million uncertain training examples au¬ 
tomatically learnt from the Web by NEEE. 


4.2 Protocols 

The scenario of experiments is that: given a pair 
of entities, a short text/mention to indicate the cor¬ 
rect relations and a set of candidate relations, we 
compare the performance between our models and 
other state-of-the-art approaches, with the metrics 
as follows, 

• Average Rank: Each candidate relation will 
gain a score calculated by Equation (7). We 
sort them in ascending order and compare 
with the corresponding ground-truth belief. 
Eor each belief in the testing set, we get the 
rank of the correct relation. The average rank 
is an aggregative indicator, to some extent, to 
judge the overall performance on relation ex¬ 
traction of an approach. 

• Hit® 10: Besides the average rank, scientists 
from the industrials concern more about the 
accuracy of extraction when selecting Top 10 
relations. This metric shows the proportion 
of beliefs that we predict the correct relation 
ranked in Top 10. 

• It is a more strict metric that can be 
referred by automatic system, since it demon¬ 
strates the accuracy when just picking the 
first predicted relation in the sorted list. 













APPROACH 

AVG. R. 

HIT@10 

HIT@1 

TransE 

I3I.8 

16.3% 

3.0% 

KRE 

29.1 

44.3% 

14.4% 

TME 

II.5 

80.0% 

56.0% 

IIKE 

7.5 

81.8% 

56.8% 

JRME 

6.2 

87.8% 

60.2% 


APPROACH 

AVG. R. 

HIT@10 

HIT@1 

TransE 

11.\ 

5.4% 

0.7% 

KRE 

57.5 

17.9% 

2.5% 

TME 

3.6 

96.3% 

63.6% 

IIKE 

4.5 

82.6% 

53.2% 

JRME 

3.0 

96.7% 

68.0% 


Table 2: Performance of TransE, KRE, IIKE, 
TME and JRME on the metrics of Average Rank, 
Hit@ 10 and Hit@ 1 in NEEE-50K dataset. 


Table 3: Performance of TransE, KRE, IIKE, 
TME and JRME on the metrics of Average Rank, 
Hit@I0 and Hit@I in NEEE-5M dataset. 


4.3 Hyperparameters 

Before displaying the evaluation results, we 
need to elaborate the hyperparameters that have 
been tried, and show the best combination of 
hyperparameters we choose. Another advantage 
of embedding-based model is that it is unnec¬ 
essary to tune many hyperparameters. Eor our 
model, we just need to set four, which are the 
uniform dimension d of entities, relations and the 
words in mentions, the margin a of KBE, the 
margin /3 of TME and the margin 7 of JRME. 
To decide the ideal set of hyperparameters, we 
use the validation set to pick the best com¬ 
bination from d G {10,20,50,100,200}, 
a G {0.1,1.0,2.0,5.0,10.0}, P G 
{0.1,1.0,2.0,5.0,10.0} and 7 G 

{0.1,1.0,2.0,5.0,10.0}. Einally, we choose 
d = 100 , a = 1 . 0 , /3 = 1.0 and 7 = 2.0 to train 
the embeddings, as this combination of hyper¬ 
parameters helps perform best on the validation 
set. 


4.4 Performance 

Table 2 and 3 illustrate the results of experiments 
on NEEE-50K and NEEE-5M, respectively. Both 
of them show that JRME performs best among 
all the approaches we implemented. We can also 
figure out that text mentions contribute a lot to 
predicting the correct relations. Moreover, Ta¬ 
ble 3 also demonstrates that not only IIKE is ro¬ 
bust to the noise in NEEE-5M dataset, which con¬ 
sists with its characteristics emphasized by Ean 
et al. ( |Ean et al., 2015^ , but also TME and 


JRME share this special “gene”. Overall, JRME 
improves the average rank of relation prediction 
about 20% compared with state-of-the-art IIKE. 


5 Conclusion 


We engage in bridging the gap between unstruc¬ 
tured free texts and structured knowledge bases 


to predict more accurate relations via proposing a 
joint embedding model between any given entity 
pair for knowledge population. The results of ex¬ 
tensive experiments with various evaluation proto¬ 
cols on both medium and large NEEE datasets ef¬ 
fectively demonstrate that our model (JRME) out¬ 
performs other state-of-the-art approaches. Be¬ 
cause of the uniform low-dimensional vector rep¬ 
resentations for entities, relations and even the 
words, evidence for prediction is compressed into 
embeddings to facilitate the information exchange 
and computing, which finally leads a huge leap 
forward in relafion extracfion. 

There sfill remain, however, several open ques¬ 
tions on fhis promising research direction in fhe 
fulure, such as exploring heifer ways lo embed fhe 
whole beliefs or menfions wilhouf losing foo much 
regularilies of knowledge and linguistics. 

Acknowledgments 

The firsf aulhor conducfed fhis research while he 
was a joinl-supervision Ph.D. sludenf in New York 
Universily. This paper is dedicaled lo all fhe mem¬ 
bers of fhe Proleus Projecf. 


References 

[Bach and Badaskar2007] Nguyen Bach and Sameer 
Badaskar. 2007. A review of relation extraction. 
Literature review for Language and Statistics 11. 

[Bollacker et al.2007] Kurt Bollacker, Robert Cook, 
and Patrick Tufts. 2007. Freebase: A shared 
database of structured general human knowledge. In 
AAAI, volume 7, pages 1962-1963. 

[Bordes et al.2013] Antoine Bordes, Nicolas Usunier, 
Alberto Garcia-Duran, Jason Weston, and Oksana 
Yakhnenko. 2013. Translating embeddings for 
modeling multi-relational data. In Advances in Neu¬ 
ral Information Processing Systems, pages 2787- 
2795. 

















[Carlson et al.2010] Andrew Carlson, Justin Betteridge, 
Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., 
and Tom M. Mitchell. 2010. Toward an architecture 
for never-ending language learning. In Proceedings 
of the Twenty-Fourth Conference on Artificial Intel¬ 
ligence (AAAI 2010). 

[Fan et al.] Miao Fan, Qiang Zhou, Andrew Abel, and 
Thomas Fang Zheng. Probabilistic belief embed¬ 
ding for large-scale knowledge population. 

[Fan et al.2014a] Miao Fan, Deli Zhao, Qiang Zhou, 
Zhiyuan Fiu, Thomas Fang Zheng, and Fdward Y. 
Chang. 2014a. Distant supervision for relation ex¬ 
traction with matrix completion. In Proceedings 
of the 52nd Annual Meeting of the Association for 
Computational Linguistics (Volume 1: Long Pa¬ 
pers), pages 839-849, Baltimore, Maryland, June. 
Association for Computational Finguistics. 

[Fan et al.2014b] Miao Fan, Qiang Zhou, Fmily Chang, 
and Thomas Fang Zheng. 2014b. Transition-based 
knowledge graph embedding with relational map¬ 
ping properties. In Proceedings of the 28th Pacific 
Asia Conference on Language, Information, and 
Computation, pages 328-337, Phuket,Thailand, De¬ 
cember. Department of Finguistics, Chulalongkorn 
University. 

[Fan et al.2015a] Miao Fan, Qiang Zhou, and 
Thomas Fang Zheng. 2015a. Teaming em¬ 
bedding representations for knowledge inference 
on imperfect and incomplete repositories. arXiv 
preprint arXiv:1503.08155. 

[Fan et al.2015b] Miao Fan, Qiang Zhou, Thomas Fang 
Zheng, and Ralph Grishman. 2015b. Probabilistic 
belief embedding for knowledge base completion. 
arXiv preprint arXiv:1505.02433. 

[Gardner et al.2013] Matt Gardner, Partha Pratim 
Talukdar, Bryan Kisiel, and Tom M. Mitchell. 
2013. Improving learning and inference in a large 
knowledge-base using latent syntactic cues. In 
EMNLP, pages 833-838. ACT. 

[Grishmanl997] Ralph Grishman. 1997. Information 
extraction: Techniques and challenges. In Interna¬ 
tional Summer School on Information Extraction: A 
Multidisciplinary Approach to an Emerging Infor¬ 
mation Technology, SCIE ’97, pages 10-27, Fon- 
don, UK, UK. Springer-Verlag. 

[GuoDong et al.2005] Zhou GuoDong, Su Jian, Zhang 
Jie, and Zhang Min. 2005. Exploring various 
knowledge in relation extraction. In Proceedings of 
the 43rd Annual Meeting on Association for Com¬ 
putational Linguistics, ACT ’05, pages 427^34, 
Stroudsburg, PA, USA. Association for Computa¬ 
tional Finguistics. 

[Fao et al.2011] Ni Fao, Tom Mitchell, and William W. 
Cohen. 2011. Random walk inference and learn¬ 
ing in a large scale knowledge base. In Proceedings 
of the 2011 Conference on Empirical Methods in 


Natural Language Processing, pages 529-539, Ed¬ 
inburgh, Scotland, UK., July. Association for Com¬ 
putational Finguistics. 

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg 
Corrado, and Jeffrey Dean. 2013. Efficient estima¬ 
tion of word representations in vector space. arXiv 
preprint arXiv:1301.3781. 

[Mintz et al.2009] Mike Mintz, Steven Bills, Rion 
Snow, and Dan Jurafsky. 2009. Distant supervision 
for relation extraction without labeled data. In Pro¬ 
ceedings of the Joint Conference of the 47th Annual 
Meeting of the ACL and the 4th International Joint 
Conference on Natural Language Processing of the 
AFNLP: Volume 2-Volume 2, pages 1003-1011. As¬ 
sociation for Computational Finguistics. 

[Sarawagi2008] Sunita Sarawagi. 2008. Information 
extraction. Foundations and trends in databases, 
l(3):261-377. 

[Zelenko et al.2003] Dmitry Zelenko, Chinatsu Aone, 
and Anthony Richardella. 2003. Kernel methods for 
relation extraction. The Journal of Machine Learn¬ 
ing Research, 3:1083-1106. 



